One of the challenges in Speech Emotion Recognition (SER) "in the wild" isthe large mismatch between training and test data (e.g. speakers and tasks). Inorder to improve the generalisation capabilities of the emotion models, wepropose to use Multi-Task Learning (MTL) and use gender and naturalness asauxiliary tasks in deep neural networks. This method was evaluated inwithin-corpus and various cross-corpus classification experiments that simulateconditions "in the wild". In comparison to Single-Task Learning (STL) basedstate of the art methods, we found that our MTL method proposed improvedperformance significantly. Particularly, models using both gender andnaturalness achieved more gains than those using either gender or naturalnessseparately. This benefit was also found in the high-level representations ofthe feature space, obtained from our method proposed, where discriminativeemotional clusters could be observed.
展开▼